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Oak Ridge National Laboratory’s Journey from Petascale to Exascale 


Mission: Providing world-class computational Vision: Deliver transforming discoveries in 


resources and specialized services for the most energy technologies, materials, biology, 
computationally intensive global challenges environment, health, etc. 
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Energy Efficient Computing — Frontier achieves 14.5 MW per EF 


Since 2009 the biggest concern with reaching Frontier first US Exascale computer 
Exascale has been energy consumption Multiple GPU per CPU drove energy efficiency 


Jaguar 3,043 MW/EF 
e ORNL pioneered GPU use in supercomputing 
beginning in 2012 with Titan thru today with Frontier. Jaguar none 
Significant part of energy efficiency improvements. Titan 1 
Summit 3 
e ASCR [Fast, Design, Path] Forward vendor Frontier 4 


investments in energy efficiency (2012-2020) further 
reduced the power consumption of computing chips 
(CPUs and GPUs).. 


Exascale made possible 


e 200x reduction in energy per FLOPS from Jaguar by 200x improvement 
to Frontier at ORNL in energy efficient 
computing 
e ORNL achieves additional energy savings from using 330MWIEF o immi i 
ter cooling in Frontier (32 C) e Cou 


ORNL Data Center PUE= 1.03 2017 2021 


On our journey to Exascale, we found an architecture that could 
excel at simulation, data analytics, and artificial intelligence 


As supercomputers got larger and larger, we expected them to be more specialized 


and limited to just a small number of applications that can exploit their growing scale 


We found that “Summit” architecture with few, PATI os Aap, 
= vance À Pt dee te 
large-memory, multi-GPU nodes excels at: ae ae See ( 
analytics 2 F 


e Data analytics — CoMet bioinformatics application for comparative genomics. 

Has achieved 2.36 ExaOps mixed precision (FP16-FP32) on Summit (2018 Gordon Bell Winner) 
e Deep Learning — Climate: neural network learns to detecting extreme global weather patterns 

Has achieved a sustained throughput of 1.0 ExaOps (FP16) on Summit 


Frontier Exascale computer uses and improves on Summits successful architecture 
e 5 TB of on-node memory, 4 GPU per node, Peak of >10 ExaOps (FP16) 
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Frontier Overview Built by HPE Powered by AMD 


Extraordinary Engineering 


Olympus rack owe nee 7 
-128 AMD nodes e 1 AMD “Trento” CPU 
%Oak Rinse < 8.000 Ibs e 4 AMD MI250X GPUs 


Devry PAYIN TEn . e 512 GiB DDR4 memory on CPU 
j pal PHO nS setenv + 512 GiB HBM2e total per node 
AMIDE (128 GiB HBM per GPU) 
e Coherent memory across the node 
e 4TB NVM 
e GPUs & CPU fully connected with AMD 
Infinity Fabric 
e 4 Cassini NICs, 100 GB/s network BW 


Powot 
Distibuton Unit Dower & water 


System 
e 2 EF Peak DP FLOPS 
e 74 compute racks 
e 29 MW Power Consumption 
e 9,408 nodes 
e 9.2 PB memory 
(4.6 PB HBM, 4.6 PB DDR4) 
e Cray Slingshot network with 
dragonfly topology 
e 37 PB Node Local Storage 
e 716 PB Center-wide storage 
e 4000 ft? foot print 


Compute blade 
e 2 AMD nodes 


OAK RIDGE Hentai o All water cooled, even DIMMS and NICs 
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Frontier multi-tier storage system is designed to excel 
at Data Science and Al for Scientific Discovery 


Capacity Performance 


Multi-tier I/O Subsystem Read Write 
37 PB Node Local Storage 65.9 TB/s 62.1 TB/s LOCAL Storage (Flashy 


11 Billion IOPS 


Gazelle SSD Storage board 
11 PB Performance tier 9.4 TB/s 9.4TB/s ea Vieranag 
695 PB Capacity tier 5.2 TB/s 4.4TB/s 


10 PB Metadata 2M Transactions per sec 
Moose HDD Storage board 
(Capacity Tier) 
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During Frontier Build -- the Chip Shortage Hit in Earnest! 


When HPE began ordering parts, suppliers said the lead time on orders was increasing an 
additional 6-12 months. 


ORNL worked with ASCR to get DPAS rating for Frontier that helped 


prioritize USA part orders (DPAS was extended to Aurora and El Capitan) 


60 Million parts needed for Frontier 
685 Different part numbers used in Frontier 
167 Frontier part numbers affected by the chip shortage 
(more than 2 million parts from dozens of suppliers worldwide) 
12 Part numbers blocked building the first compute cabinet 
15 Part numbers shortage for AMD building all the MI200 cards for Frontier 


It wasn't exotic parts like CPUs or GPUs, rather parts needed by everyone — in 
cars, IVs, electronics, such as, voltage regulators, oscillators, power modules 
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Supply Chain Remained a Constant Battle till Delivery 


HPE saw commitments for parts deliveries from sub-contractors being broken weekly as the chip 
shortage got worse. Had to call every supplier every week (Sometimes every day) 


HPE had 15 people whose sole job was to try to find the needed parts or 
alternatives for Frontier. Using HPE’s size to negotiate with suppliers, looking for 


handfuls of parts in warehouses or at other companies who were also stuck 
because of chip shortage. 
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April 30 — July 15: Initial shortage of 167 part 
numbers reduced down to 1 part number 


e July 15" only found enough to build 63 of 74 
cabinets (looking for about 8,000 more) 


¢ Took three more weeks to find all 8,000 
e By that time had a couple more decommits The final parts arrived the morning 
on another part. the last Frontier node was assembled 
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Last Cabinet of Frontier Delivered to ORNL October 18th 
Thanks to Heroic Efforts of the HPE and AMD teams 


Crusher and Borg 
Test & Development 


Systems (2) 
= TI p 


'S 


Over 716 PB 
Storage 


Compute 
Footprint 
4000 ft? 

(74 cabinets) 


Last cabinet being Max Power 


= rolled into place. 
(Each cabinet ORNL Facility PUE 1.03 


>... 


gon “nee After the cabinets arrived they had to be connected. There are 81,000 cables between all the Frontier nodes 
Qe Nacionsl jews y 


OAK RIDGE 
LE AU-RSHIP 
COMPILING FAC 


Getting Frontier Ready for Early Science 


e As we saw with Titan and Summit, it takes a number of 
months to get all the hardware and software stabilized 


e HPE continues testing and stabilization of Frontier and 
its file system 


e Early Science Teams in CAAR and ECP got access to 
the “Crusher” Test & Development system in 
November 2021. 


e Rest of ECP users (~800) given Crusher access 
January 2022. 


ECP is scheduled for full Frontier access July 2022 


INCITE use of Frontier scheduled for January 2023 
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“Crusher” TDS system 

e 2 cabinets of Frontier HW 
e 192 nodes 

e Slingshot 11 w/ Cassini 

e Same Software as Frontier 


Getting Users Ready for Early Science 


Crusher Training January 13, 2022 
e HPE and AMD presented Frontier architecture details, programming environment features and 
tips and tricks, ORNL provided login instructions and a Crusher Quick Start Guide 


Two Crusher Hackathons in February for CAAR and ECP Early Science Teams 


February 9-11, 2022 


NAMD 
LSMS 
CoMet 
GETS 
NuCCOR 
PIConGPU 
LBPM 
ExaBiome 
FUN3D 


February 15-17, 2022 


ExaStar 
LatticeQCD 
NWCHEMex 
GAMESS 
PELE 
ExaSMR 
WDMApp 
E3SM 
ExaAM 


Quotes from Hackathon 
“Great interactions with HPE and AMD staff 
in resolving issues” 


“Reduced unit runtime from 8 hr. to 11 min.” 


“Quick help and learned some useful tricks 
and tips” 


“Got 4x speedup on ExaAM PicassoMPM 
code” 
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Initial CAAR Early Science Results on Crusher 


Science CAAR Recent Results on Crusher 
Area App 


Advanced LSMS MI250 getting up to 10x speedup over 

materials Summit V100 

Turbulent GESTS Crusher GCD* achieves 6x speedup over 

Flows Summit V100 

Porus Media LBPM Crusher GCD slightly faster than Summit 
V100. 

Plasma PIConGPU Seeing 2.5x — 5x speedup over Summit 

Physics 

Atomic NuCCOR Crusher MI250 performance gains of up to 

nucleus 8x over Summit V100 

Health CoMet Has been run on Frontier up to 3,210 nodes 

Astrophysics Cholla Total of 15x speedup = Crusher HW getting 


additional 3x over Summit + 5x from SW 


XOAK RIDGE PENS * MI250 GPU is composed of two GCDs 


FePt nanoparticle == aS% 


Magnetic 


~ Turbulent Flows 
anisotropy in iE gaia r 


Properties 
“of atoms 


Astrophysics 
hydrodynamics 


Progress on Crusher by ECP KPP-1 Applications 


Apps selected to demonstrate performance improvement for mission-critical problems 


Quantum Chromodynamics LatticeQCD Improving Performance 
Chemistry (Biofuels) NWChemEx Initial Build/Test 
Extreme Materials (MD) EXAALT Improving Performance 
Quantum Materials (QMC) QMCPACK Blocked (MPI) 
Nuclear Reactors (SMRs) ExaSMR Improving Performance 
Fusion Plasmas WDMApp Improving Performance 
Particle accelerators WarpX Improving Performance 
Cosmology ExaSky Improving Performance 
Earthquakes EQSIM Improving Performance 
Climate Change E3SM-MMF Improving Performance 
Cancer Research CANDLE Improving Performance 
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Progress on Crusher by ECP KPP-2 Applications 


Apps selected to broaden the reach of exascale science and mission capability 


Catalysis GAMESS Blocked (ROCm 5.x) 

Additive Manufacturing ExaAM Improving Performance 
Wind Energy ExaWind Improving Performance 
Combustion PELE Improving Performance 
Carbon Capture MFIX-Exa Improving Performance 
Astrophysics ExaStar Improving Performance 
Subsurface Subsurface Improving Performance 
Energy Grid ExaSGD Improving Performance 

Metagenomics ExaBiome Blocked (GASNet) 
LCLS Molecular Structure ExaFEL Improving Performance 


HO K RIDGE | 24% Rese, 
] 4 - National | aborutory ul PUI j 


UUI ING FALI IY 


Climate Change 


Subsurface use 
for carbon capture, 
petroleum extraction, 

waste disposal 


Accurate regional 
impact assessments 
in Earth system 
models 


Stress-resistant crop 
analysis and catalytic 
conversion 
of biomass-derived 
alcohols 


Metagenomics 
for analysis of 
biogeochemical 
cycles, climate 
change, 
environmental 


Energy security 


Reliable and 
efficient planning 
of the power grid 


Turbine wind plant | 


efficiency 


Design and 
commercialization 
of Small Modular 

Reactors 


Nuclear fission 
and fusion reactor 
materials design 


High-efficiency, 
low-emission 
combustion engine 
and gas turbine 
design 


Health care 


Accelerate 
and translate 
cancer research 


(partnership with NIH) 


Developing Al for 
Precision Drug 
Therapy in Fight 

Against Cancer 
E AEN 


ECP Application Portfolio — Early Science runs on Frontier 


Scientific discovery $ Economic security 


Cosmological probe 
of the standard model 
of particle physics 


Validate fundamental 
laws of nature 


Find, predict, 
and control materials 
and properties 


Light source-enabled 
analysis of protein 
and molecular 
structure and design 


Predict and control 
magnetically confined 
fusion plasmas 


Demystify origin of 
chemical elements 


Additive 
manufacturing 
of qualifiable 
metal parts 


Scale up of clean 
fossil fuel 
combustion 


Biofuel catalyst 
design 


Seismic hazard 
risk assessment 


remediation 
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Once KPP goal is achieved, ECP team can graduate to 


Early Science allocation on Frontier and begin using 
their ECP codes for Early science results 
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FRONTIER 
Questions? ® 
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